The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
About a dozen packages that make up the official tidyverse; however, there are dozens of tidyverse-adjacent packages that follow this philosophy, grammar, and data structures and work well with the official tidyverse packages.
tidy data: a data frame where each row is an observation and each column contains the value of a single variable
pipe operator: %>%, channels the flow of tidy operations
uniform standards for package organization, testing procedures, coding style, etc.
Ease of use, path of least resistance
“Programs must be written for people to read and only incidentally for machines to execute”
Parsimony, simplicitiy
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.
row.names() for a tibble - Tidy data requires that variables be stored in a consistent way, removing the need for row names.# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
You can create a tibble just like a dataframe
Pipes are written as %>%and they should be read as “and then”.
mutate() can be used to create variables based on existing variables from the dataset.
Ozone Solar.R Wind Temp Month Day totalDays
1 41 190 7.4 67 5 1 153.5
2 36 118 8.0 72 5 2 154.5
3 12 149 12.6 74 5 3 155.5
4 18 313 11.5 62 5 4 156.5
5 NA NA 14.3 56 5 5 157.5
6 28 NA 14.9 66 5 6 158.5
7 23 299 8.6 65 5 7 159.5
8 19 99 13.8 59 5 8 160.5
9 8 19 20.1 61 5 9 161.5
10 NA 194 8.6 69 5 10 162.5
11 7 NA 6.9 74 5 11 163.5
12 16 256 9.7 69 5 12 164.5
13 11 290 9.2 66 5 13 165.5
14 14 274 10.9 68 5 14 166.5
15 18 65 13.2 58 5 15 167.5
16 14 334 11.5 64 5 16 168.5
17 34 307 12.0 66 5 17 169.5
18 6 78 18.4 57 5 18 170.5
19 30 322 11.5 68 5 19 171.5
20 11 44 9.7 62 5 20 172.5
21 1 8 9.7 59 5 21 173.5
22 11 320 16.6 73 5 22 174.5
23 4 25 9.7 61 5 23 175.5
24 32 92 12.0 61 5 24 176.5
25 NA 66 16.6 57 5 25 177.5
26 NA 266 14.9 58 5 26 178.5
27 NA NA 8.0 57 5 27 179.5
28 23 13 12.0 67 5 28 180.5
29 45 252 14.9 81 5 29 181.5
30 115 223 5.7 79 5 30 182.5
31 37 279 7.4 76 5 31 183.5
32 NA 286 8.6 78 6 1 184.0
33 NA 287 9.7 74 6 2 185.0
34 NA 242 16.1 67 6 3 186.0
35 NA 186 9.2 84 6 4 187.0
36 NA 220 8.6 85 6 5 188.0
37 NA 264 14.3 79 6 6 189.0
38 29 127 9.7 82 6 7 190.0
39 NA 273 6.9 87 6 8 191.0
40 71 291 13.8 90 6 9 192.0
41 39 323 11.5 87 6 10 193.0
42 NA 259 10.9 93 6 11 194.0
43 NA 250 9.2 92 6 12 195.0
44 23 148 8.0 82 6 13 196.0
45 NA 332 13.8 80 6 14 197.0
46 NA 322 11.5 79 6 15 198.0
47 21 191 14.9 77 6 16 199.0
48 37 284 20.7 72 6 17 200.0
49 20 37 9.2 65 6 18 201.0
50 12 120 11.5 73 6 19 202.0
51 13 137 10.3 76 6 20 203.0
52 NA 150 6.3 77 6 21 204.0
53 NA 59 1.7 76 6 22 205.0
54 NA 91 4.6 76 6 23 206.0
55 NA 250 6.3 76 6 24 207.0
56 NA 135 8.0 75 6 25 208.0
57 NA 127 8.0 78 6 26 209.0
58 NA 47 10.3 73 6 27 210.0
59 NA 98 11.5 80 6 28 211.0
60 NA 31 14.9 77 6 29 212.0
61 NA 138 8.0 83 6 30 213.0
62 135 269 4.1 84 7 1 214.5
63 49 248 9.2 85 7 2 215.5
64 32 236 9.2 81 7 3 216.5
65 NA 101 10.9 84 7 4 217.5
66 64 175 4.6 83 7 5 218.5
67 40 314 10.9 83 7 6 219.5
68 77 276 5.1 88 7 7 220.5
69 97 267 6.3 92 7 8 221.5
70 97 272 5.7 92 7 9 222.5
71 85 175 7.4 89 7 10 223.5
72 NA 139 8.6 82 7 11 224.5
73 10 264 14.3 73 7 12 225.5
74 27 175 14.9 81 7 13 226.5
75 NA 291 14.9 91 7 14 227.5
76 7 48 14.3 80 7 15 228.5
77 48 260 6.9 81 7 16 229.5
78 35 274 10.3 82 7 17 230.5
79 61 285 6.3 84 7 18 231.5
80 79 187 5.1 87 7 19 232.5
81 63 220 11.5 85 7 20 233.5
82 16 7 6.9 74 7 21 234.5
83 NA 258 9.7 81 7 22 235.5
84 NA 295 11.5 82 7 23 236.5
85 80 294 8.6 86 7 24 237.5
86 108 223 8.0 85 7 25 238.5
87 20 81 8.6 82 7 26 239.5
88 52 82 12.0 86 7 27 240.5
89 82 213 7.4 88 7 28 241.5
90 50 275 7.4 86 7 29 242.5
91 64 253 7.4 83 7 30 243.5
92 59 254 9.2 81 7 31 244.5
93 39 83 6.9 81 8 1 245.0
94 9 24 13.8 81 8 2 246.0
95 16 77 7.4 82 8 3 247.0
96 78 NA 6.9 86 8 4 248.0
97 35 NA 7.4 85 8 5 249.0
98 66 NA 4.6 87 8 6 250.0
99 122 255 4.0 89 8 7 251.0
100 89 229 10.3 90 8 8 252.0
101 110 207 8.0 90 8 9 253.0
102 NA 222 8.6 92 8 10 254.0
103 NA 137 11.5 86 8 11 255.0
104 44 192 11.5 86 8 12 256.0
105 28 273 11.5 82 8 13 257.0
106 65 157 9.7 80 8 14 258.0
107 NA 64 11.5 79 8 15 259.0
108 22 71 10.3 77 8 16 260.0
109 59 51 6.3 79 8 17 261.0
110 23 115 7.4 76 8 18 262.0
111 31 244 10.9 78 8 19 263.0
112 44 190 10.3 78 8 20 264.0
113 21 259 15.5 77 8 21 265.0
114 9 36 14.3 72 8 22 266.0
115 NA 255 12.6 75 8 23 267.0
116 45 212 9.7 79 8 24 268.0
117 168 238 3.4 81 8 25 269.0
118 73 215 8.0 86 8 26 270.0
119 NA 153 5.7 88 8 27 271.0
120 76 203 9.7 97 8 28 272.0
121 118 225 2.3 94 8 29 273.0
122 84 237 6.3 96 8 30 274.0
123 85 188 6.3 94 8 31 275.0
124 96 167 6.9 91 9 1 275.5
125 78 197 5.1 92 9 2 276.5
126 73 183 2.8 93 9 3 277.5
127 91 189 4.6 93 9 4 278.5
128 47 95 7.4 87 9 5 279.5
129 32 92 15.5 84 9 6 280.5
130 20 252 10.9 80 9 7 281.5
131 23 220 10.3 78 9 8 282.5
132 21 230 10.9 75 9 9 283.5
133 24 259 9.7 73 9 10 284.5
134 44 236 14.9 81 9 11 285.5
135 21 259 15.5 76 9 12 286.5
136 28 238 6.3 77 9 13 287.5
137 9 24 10.9 71 9 14 288.5
138 13 112 11.5 71 9 15 289.5
139 46 237 6.9 78 9 16 290.5
140 18 224 13.8 67 9 17 291.5
141 13 27 10.3 76 9 18 292.5
142 24 238 10.3 68 9 19 293.5
143 16 201 8.0 82 9 20 294.5
144 13 238 12.6 64 9 21 295.5
145 23 14 9.2 71 9 22 296.5
146 36 139 10.3 81 9 23 297.5
147 7 49 10.3 69 9 24 298.5
148 14 20 16.6 63 9 25 299.5
149 30 193 6.9 70 9 26 300.5
150 NA 145 13.2 77 9 27 301.5
151 14 191 14.3 75 9 28 302.5
152 18 131 8.0 76 9 29 303.5
153 20 223 11.5 68 9 30 304.5
Another example
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
diamonds %>%
mutate(price200 = price - 200, # $200 OFF from the original price
price20perc = price * 0.20, # 20% of the original price
price20percoff = price * 0.80, # 20% OFF from the original price
pricepercarat = price / carat, # ratio of price to carat
pizza = depth ^ 2) # Square the original depth# A tibble: 53,940 × 15
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
# pricepercarat <dbl>, pizza <dbl>
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z m
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3933.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3933.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3933.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 3933.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 3933.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 3933.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 3933.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 3933.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 3933.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 3933.
# ℹ 53,930 more rows
diamonds %>%
mutate(m = mean(price), # calculates the mean price
sd = sd(price), # calculates standard deviation
med = median(price)) # calculates the median price# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z m sd
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3933. 3989.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3933. 3989.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3933. 3989.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 3933. 3989.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 3933. 3989.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 3933. 3989.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 3933. 3989.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 3933. 3989.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 3933. 3989.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>
modifies the values within a variable. Below, “Ideal” is now “IDEAL”
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z cut.new
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 IDEAL
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premium
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 Good
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Premium
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 Good
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very Good
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very Good
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very Good
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Fair
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very Good
# ℹ 53,930 more rows
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z cut.new
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 IDEAL
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 pizza
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 Good
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 pizza
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 Good
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very Good
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very Good
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very Good
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Okay
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very Good
# ℹ 53,930 more rows
collapses all rows and returns a one-row summary. R will recognize both the British and American spelling (summarise/summarize).
Similar to mutate(), we can also perform multiple operations with summarize() and nest other useful functions inside it:
diamonds %>%
summarize(avg.price = mean(price), # average price of all diamonds
dbl.price = mean(price) * 2, # calculating double the average price
random.add = 1 + 2, # a math operation without an existing variable
avg.carat = mean(carat), # average carat size of all diamonds
stdev.price = sd(price)) # calculating the standard deviation # A tibble: 1 × 5
avg.price dbl.price random.add avg.carat stdev.price
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3933. 7866. 3 0.798 3989.
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb aveMPG
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 24.5
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 24.5
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 24.5
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 16.1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 16.1
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 16.1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 16.1
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 24.5
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 24.5
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 24.5
# ℹ 22 more rows
If you forget to ungroup(), future calculations on that object will be grouped!
## Creating identification number to represent 50 individual people
ID <- c(1:50)
## Creating sex variable (25 males/25 females)
Sex <- rep(c("male", "female"), 25) # rep stands for replicate
## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33,
39, 28, 26, 29, 33, 22, 35, 23, 26, 36,
21, 20, 31, 21, 35, 39, 36, 22, 22, 25,
27, 30, 26, 34, 38, 39, 30, 29, 26, 25,
26, 36, 23, 21, 21, 39, 26, 26, 27, 21)
## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173,
0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482,
0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752,
0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567,
0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)
## Creating a unified dataset that puts together all variables
data <- tibble(ID, Sex, Age, Score)Another example
Only retain specific rows of data that meet the specified requirement(s).
# A tibble: 1,610 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 1,600 more rows
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
This is effectively identical to subset in base R, except for some minute differences.
# A tibble: 53,940 × 5
carat cut color clarity depth
<dbl> <ord> <ord> <ord> <dbl>
1 0.23 Ideal E SI2 61.5
2 0.21 Premium E SI1 59.8
3 0.23 Good E VS1 56.9
4 0.29 Premium I VS2 62.4
5 0.31 Good J SI2 63.3
6 0.24 Very Good J VVS2 62.8
7 0.24 Very Good I VVS1 62.3
8 0.26 Very Good H SI1 61.9
9 0.22 Fair E VS2 65.1
10 0.23 Very Good H VS1 59.4
# ℹ 53,930 more rows
# A tibble: 53,940 × 5
carat cut color clarity depth
<dbl> <ord> <ord> <ord> <dbl>
1 0.23 Ideal E SI2 61.5
2 0.21 Premium E SI1 59.8
3 0.23 Good E VS1 56.9
4 0.29 Premium I VS2 62.4
5 0.31 Good J SI2 63.3
6 0.24 Very Good J VVS2 62.8
7 0.24 Very Good I VVS1 62.3
8 0.26 Very Good H SI1 61.9
9 0.22 Fair E VS2 65.1
10 0.23 Very Good H VS1 59.4
# ℹ 53,930 more rows
# A tibble: 53,940 × 9
carat color clarity depth table price x y z
<dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# A tibble: 53,940 × 8
carat clarity depth table price x y z
<dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# A tibble: 53,940 × 5
table price x y z
<dbl> <int> <dbl> <dbl> <dbl>
1 55 326 3.95 3.98 2.43
2 61 326 3.89 3.84 2.31
3 65 327 4.05 4.07 2.31
4 58 334 4.2 4.23 2.63
5 58 335 4.34 4.35 2.75
6 57 336 3.94 3.96 2.48
7 57 336 3.95 3.98 2.47
8 55 337 4.07 4.11 2.53
9 61 337 3.87 3.78 2.49
10 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# A tibble: 53,940 × 5
table price x y z
<dbl> <int> <dbl> <dbl> <dbl>
1 55 326 3.95 3.98 2.43
2 61 326 3.89 3.84 2.31
3 65 327 4.05 4.07 2.31
4 58 334 4.2 4.23 2.63
5 58 335 4.34 4.35 2.75
6 57 336 3.94 3.96 2.48
7 57 336 3.95 3.98 2.47
8 55 337 4.07 4.11 2.53
9 61 337 3.87 3.78 2.49
10 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
x y z carat cut color clarity depth table price
<dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326
2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326
3 4.05 4.07 2.31 0.23 Good E VS1 56.9 65 327
4 4.2 4.23 2.63 0.29 Premium I VS2 62.4 58 334
5 4.34 4.35 2.75 0.31 Good J SI2 63.3 58 335
6 3.94 3.96 2.48 0.24 Very Good J VVS2 62.8 57 336
7 3.95 3.98 2.47 0.24 Very Good I VVS1 62.3 57 336
8 4.07 4.11 2.53 0.26 Very Good H SI1 61.9 55 337
9 3.87 3.78 2.49 0.22 Fair E VS2 65.1 61 337
10 4 4.05 2.39 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
Can you explain each step of this function?
diamonds %>%
group_by(color, clarity) %>%
mutate(price200 = mean(price)) %>%
ungroup() %>%
mutate(random10 = 10 + price) %>%
select(cut, color,
clarity, price,
price200, random10) %>%
arrange(color) %>%
group_by(cut) %>%
mutate(dis = n_distinct(price),
rowID = row_number()) %>%
ungroup() # A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
diamonds %>% # utilizes the diamonds dataset
group_by(color, clarity) %>% # groups data by color and clarity variables
mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
ungroup() %>% # data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% # new variable, original price + $10
select(cut, color, # retain only these listed columns
clarity, price,
price200, random10) %>%
arrange(color) %>% # visualize data ordered by color
group_by(cut) %>% # group data by cut
mutate(dis = n_distinct(price), # counts the number of unique price values per cut
rowID = row_number()) %>% # numbers each row consecutively for each cut
ungroup() # final ungrouping of data# A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
midwest %>%
group_by(state) %>%
summarize(poptotalmean = mean(poptotal),
poptotalmed = median(poptotal),
popmax = max(poptotal),
popmin = min(poptotal),
popdistinct = n_distinct(poptotal),
popfirst = first(poptotal),
popany = any(poptotal < 5000),
popany2 = any(poptotal > 2000000)) %>%
ungroup()# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
Collapses the rows and counts the number of observations per group of values.
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
# A tibble: 5 × 2
# Groups: cut [5]
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
# A tibble: 40 × 3
cut clarity n
<ord> <ord> <int>
1 Fair I1 210
2 Fair SI2 466
3 Fair SI1 408
4 Fair VS2 261
5 Fair VS1 170
6 Fair VVS2 69
7 Fair VVS1 17
8 Fair IF 9
9 Good I1 96
10 Good SI2 1081
# ℹ 30 more rows
# A tibble: 40 × 3
# Groups: cut, clarity [40]
cut clarity n
<ord> <ord> <int>
1 Fair I1 210
2 Fair SI2 466
3 Fair SI1 408
4 Fair VS2 261
5 Fair VS1 170
6 Fair VVS2 69
7 Fair VVS1 17
8 Fair IF 9
9 Good I1 96
10 Good SI2 1081
# ℹ 30 more rows
# A tibble: 40 × 3
# Groups: cut [5]
cut clarity n
<ord> <ord> <int>
1 Fair I1 210
2 Fair SI2 466
3 Fair SI1 408
4 Fair VS2 261
5 Fair VS1 170
6 Fair VVS2 69
7 Fair VVS1 17
8 Fair IF 9
9 Good I1 96
10 Good SI2 1081
# ℹ 30 more rows
## download file
conservation <- read_csv("https://raw.githubusercontent.com/suzanbaert/Dplyr_Tutorials/master/conservation_explanation.csv")
## take a look at this file
conservation# A tibble: 11 × 1
`conservation abbreviation`
<chr>
1 EX = Extinct
2 EW = Extinct in the wild
3 CR = Critically Endangered
4 EN = Endangered
5 VU = Vulnerable
6 NT = Near Threatened
7 LC = Least Concern
8 DD = Data deficient
9 NE = Not evaluated
10 PE = Probably extinct (informal)
11 PEW = Probably extinct in the wild (informal)
conservation %>%
separate(`conservation abbreviation`,
into = c("abbreviation", "description"), sep = " = ")# A tibble: 11 × 2
abbreviation description
<chr> <chr>
1 EX Extinct
2 EW Extinct in the wild
3 CR Critically Endangered
4 EN Endangered
5 VU Vulnerable
6 NT Near Threatened
7 LC Least Concern
8 DD Data deficient
9 NE Not evaluated
10 PE Probably extinct (informal)
11 PEW Probably extinct in the wild (informal)
conservation %>%
separate(`conservation abbreviation`,
into = c("abbreviation", "description"), sep = " = ") %>%
unite(united_col, abbreviation, description, sep = " = ")# A tibble: 11 × 1
united_col
<chr>
1 EX = Extinct
2 EW = Extinct in the wild
3 CR = Critically Endangered
4 EN = Endangered
5 VU = Vulnerable
6 NT = Near Threatened
7 LC = Least Concern
8 DD = Data deficient
9 NE = Not evaluated
10 PE = Probably extinct (informal)
11 PEW = Probably extinct in the wild (informal)
This function takes the existing column names of your dataset, converts them all to lowercase letters and numbers, and separates all words using the underscore character. For example, there is a space in the column name for conservation.
# A tibble: 11 × 1
conservation_abbreviation
<chr>
1 EX = Extinct
2 EW = Extinct in the wild
3 CR = Critically Endangered
4 EN = Endangered
5 VU = Vulnerable
6 NT = Near Threatened
7 LC = Least Concern
8 DD = Data deficient
9 NE = Not evaluated
10 PE = Probably extinct (informal)
11 PEW = Probably extinct in the wild (informal)
# A tibble: 53,940 × 10
carat cut color clarity depth table PRICE x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# is the same as
diamonds %>%
mutate(PRICE = price) %>% # creates new variable based on old variable
select(-price) # removes old variable from dataset# A tibble: 53,940 × 10
carat cut color clarity depth table x y z PRICE
<dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 3.95 3.98 2.43 326
2 0.21 Premium E SI1 59.8 61 3.89 3.84 2.31 326
3 0.23 Good E VS1 56.9 65 4.05 4.07 2.31 327
4 0.29 Premium I VS2 62.4 58 4.2 4.23 2.63 334
5 0.31 Good J SI2 63.3 58 4.34 4.35 2.75 335
6 0.24 Very Good J VVS2 62.8 57 3.94 3.96 2.48 336
7 0.24 Very Good I VVS1 62.3 57 3.95 3.98 2.47 336
8 0.26 Very Good H SI1 61.9 55 4.07 4.11 2.53 337
9 0.22 Fair E VS2 65.1 61 3.87 3.78 2.49 337
10 0.23 Very Good H VS1 59.4 61 4 4.05 2.39 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
carat cut color clarity depth table price Length Width Depth
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
carat cut color clarity depth table price Length Width Depth
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
A vectorized form of if else
x <- 1:70
case_when(
x %% 35 == 0 ~ "fizz buzz",
x %% 5 == 0 ~ "fizz",
x %% 7 == 0 ~ "buzz",
.default = as.character(x)
) [1] "1" "2" "3" "4" "fizz" "6"
[7] "buzz" "8" "9" "fizz" "11" "12"
[13] "13" "buzz" "fizz" "16" "17" "18"
[19] "19" "fizz" "buzz" "22" "23" "24"
[25] "fizz" "26" "27" "buzz" "29" "fizz"
[31] "31" "32" "33" "34" "fizz buzz" "36"
[37] "37" "38" "39" "fizz" "41" "buzz"
[43] "43" "44" "fizz" "46" "47" "48"
[49] "buzz" "fizz" "51" "52" "53" "54"
[55] "fizz" "buzz" "57" "58" "59" "fizz"
[61] "61" "62" "buzz" "64" "fizz" "66"
[67] "67" "68" "69" "fizz buzz"
starwars %>%
select(name:mass, gender, species) %>%
mutate(
type = case_when(
height > 200 | mass > 200 ~ "large",
species == "Droid" ~ "robot",
.default = "other"
)
)# A tibble: 87 × 6
name height mass gender species type
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 masculine Human other
2 C-3PO 167 75 masculine Droid robot
3 R2-D2 96 32 masculine Droid robot
4 Darth Vader 202 136 masculine Human large
5 Leia Organa 150 49 feminine Human other
6 Owen Lars 178 120 masculine Human other
7 Beru Whitesun lars 165 75 feminine Human other
8 R5-D4 97 32 masculine Droid robot
9 Biggs Darklighter 183 84 masculine Human other
10 Obi-Wan Kenobi 182 77 masculine Human other
# ℹ 77 more rows
#create data frame
df <- data.frame(player = c('AJ', 'Bob', 'Chad', 'Dan', 'Eric', 'Frank'),
position = c('G', 'F', 'F', 'G', 'C', NA),
points = c(12, 15, 19, 22, 32, NA),
assists = c(5, 7, 7, 12, 11, NA))
#view data frame
df player position points assists
1 AJ G 12 5
2 Bob F 15 7
3 Chad F 19 7
4 Dan G 22 12
5 Eric C 32 11
6 Frank <NA> NA NA
## take conservation dataset and separate information
## into two columns
## call that new object `conserve`
conserve <- conservation %>%
separate(`conservation abbreviation`,
into = c("abbreviation", "description"), sep = " = ")
conserve# A tibble: 11 × 2
abbreviation description
<chr> <chr>
1 EX Extinct
2 EW Extinct in the wild
3 CR Critically Endangered
4 EN Endangered
5 VU Vulnerable
6 NT Near Threatened
7 LC Least Concern
8 DD Data deficient
9 NE Not evaluated
10 PE Probably extinct (informal)
11 PEW Probably extinct in the wild (informal)
## now lets join the two datasets together
msleep %>%
mutate(conservation = toupper(conservation)) %>%
left_join(conserve, by = c("conservation" = "abbreviation"))# A tibble: 83 × 12
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheet… Acin… carni Carn… LC 12.1 NA NA 11.9
2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mount… Aplo… herbi Rode… NT 14.4 2.4 NA 9.6
4 Great… Blar… omni Sori… LC 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… DOMESTICATED 4 0.7 0.667 20
6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 North… Call… carni Carn… VU 8.7 1.4 0.383 15.3
8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… DOMESTICATED 10.1 2.9 0.333 13.9
10 Roe d… Capr… herbi Arti… LC 3 NA NA 21
# ℹ 73 more rows
# ℹ 3 more variables: brainwt <dbl>, bodywt <dbl>, description <chr>
Return all rows where there are matches from both
R inner_join dplyr Package Function
left_join retains all rows of the data table, which is inserted first into the function (i.e. the X-data)
R left_join dplyr Package Function
right_join function retains all rows of the data on the right side (i.e. the Y-data). If you compare left join vs. right join, you can see that both functions are keeping the rows of the opposite data.
R right_join dplyr Package Function
full_join functions retains all rows of both input data sets and inserts NA when an ID is missing in one of the data frames.
R full_join dplyr Package Function
The next two join functions (i.e. semi_join and anti_join) are so called filtering joins. Filtering joins keep cases from the left data table (i.e. the X-data) and use the right data (i.e. the Y-data) as filter.
Wide: Data are often entered and stored in a “wide” format - where a subject’s characteristics or responses are stored in a single row. While this may be useful for presentation, it is not ideal for some types of analysis.
Model mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mtcars_long <- pivot_longer(mtcars,
cols = mpg:carb,
names_to = "Category",
values_to = "Value"
)
mtcars_long# A tibble: 352 × 3
Model Category Value
<chr> <chr> <dbl>
1 Mazda RX4 mpg 21
2 Mazda RX4 cyl 6
3 Mazda RX4 disp 160
4 Mazda RX4 hp 110
5 Mazda RX4 drat 3.9
6 Mazda RX4 wt 2.62
7 Mazda RX4 qsec 16.5
8 Mazda RX4 vs 0
9 Mazda RX4 am 1
10 Mazda RX4 gear 4
# ℹ 342 more rows
# A tibble: 104 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama income 24476 136
2 01 Alabama rent 747 3
3 02 Alaska income 32940 508
4 02 Alaska rent 1200 13
5 04 Arizona income 27517 148
6 04 Arizona rent 972 4
7 05 Arkansas income 23789 165
8 05 Arkansas rent 709 5
9 06 California income 29454 109
10 06 California rent 1358 3
# ℹ 94 more rows
rent_income_wide <- pivot_wider(us_rent_income,
# id_cols = optional vector of unaffected columns,
names_from = c(variable),
values_from = c(estimate, moe),
names_sep = "_"
)
rent_income_wide# A tibble: 52 × 6
GEOID NAME estimate_income estimate_rent moe_income moe_rent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 747 136 3
2 02 Alaska 32940 1200 508 13
3 04 Arizona 27517 972 148 4
4 05 Arkansas 23789 709 165 5
5 06 California 29454 1358 109 3
6 08 Colorado 32401 1125 109 5
7 09 Connecticut 35326 1123 195 5
8 10 Delaware 31560 1076 247 10
9 11 District of Columbia 43198 1424 681 17
10 12 Florida 25952 1077 70 3
# ℹ 42 more rows
tbi_age %>%
dplyr::group_by(age_group) %>%
dplyr::summarise(sum_num = sum(number_est, na.rm = TRUE))# A tibble: 11 × 2
age_group sum_num
<chr> <dbl>
1 0-17 836718
2 0-4 331403
3 15-24 476652
4 25-34 319303
5 35-44 240462
6 45-54 265417
7 5-14 348913
8 55-64 241511
9 65-74 211035
10 75+ 442377
11 Total 2877539
tbi_age %>%
dplyr::select(type, injury_mechanism, age_group, number_est) %>%
tidyr::pivot_wider(names_from = age_group, values_from = number_est)# A tibble: 21 × 13
type injury_mechanism `0-17` `0-4` `5-14` `15-24` `25-34` `35-44` `45-54`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Emerge… Motor Vehicle C… 47138 5464 19785 103892 71641 44108 40020
2 Emerge… Unintentional F… 397190 230776 133084 96568 70210 68830 95127
3 Emerge… Unintentionally… 229236 53436 120839 106679 44404 32479 30495
4 Emerge… Other unintenti… 55785 12007 30656 37118 22360 17541 17808
5 Emerge… Intentional sel… NA NA NA 870 650 421 247
6 Emerge… Assault 24360 674 9690 65399 57213 34100 27682
7 Emerge… Other or no mec… 57983 19360 26022 33395 20974 16503 15962
8 Hospit… Motor Vehicle C… 5830 870 2395 12925 11050 7305 8490
9 Hospit… Unintentional F… 7935 4700 2270 3910 4470 5640 12010
10 Hospit… Unintentionally… 1985 510 980 1070 635 610 685
# ℹ 11 more rows
# ℹ 4 more variables: `55-64` <dbl>, `65-74` <dbl>, `75+` <dbl>, Total <dbl>
Standard print
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Pretty broom/tidy print
Replacing for loops with more concise automation
[[1]]
[1] 1.2753262 1.2532608 1.5152521 0.9632092 1.1822736 1.9727132 1.2148138
[8] 1.1607379 0.8689666 0.5555470
[[2]]
[1] 1.9551079 1.3365568 1.2316640 2.3326519 3.1243529 2.3953717 4.1700423
[8] 0.5872410 0.2959162 1.7703495
[[3]]
[1] 2.2391719 2.9379073 5.2970526 3.7504805 4.2398555 0.9076744 3.0377226
[8] 4.3998811 2.1246664 2.9401894
[[4]]
[1] 4.627270 1.394210 2.458875 3.188346 3.132968 4.087950 3.478214 3.277804
[9] 4.090909 3.935028
[[5]]
[1] 4.628467 4.919631 6.993296 4.567888 4.345368 5.459781 4.528624 5.830616
[9] 6.160098 3.666932
[[6]]
[1] 5.999579 5.897450 6.247411 4.400042 4.787289 6.347242 6.353750 5.681297
[9] 5.705933 6.196801
[[7]]
[1] 6.798277 6.705364 7.070292 6.843013 7.111320 7.556265 5.455063 5.737431
[9] 6.483021 7.262715
[[8]]
[1] 8.020697 7.250144 8.190980 7.393124 9.294098 8.168119 7.507259 7.957210
[9] 9.648788 7.040450
[[9]]
[1] 10.052380 10.180445 9.223869 8.939108 8.378289 8.416154 8.018666
[8] 7.537818 8.970469 9.238761
[[10]]
[1] 10.122628 9.453996 8.297895 9.534022 9.718270 8.616959 9.780240
[8] 9.594422 9.949405 9.301528
Multiple vectors
# generate volume function
volume <- function(diameter, height){
# convert diameter in inches to raidus in feet
radius_ft <- (diameter/2)/12
# calculate volume
output <- pi * radius_ft^2 * height
return(output)
}
map2_dbl(trees$Girth, trees$Height, volume) [1] 26.30157 26.22030 26.60929 43.29507 50.58013 52.80232 43.55687
[8] 49.49645 53.76050 51.31268 55.01883 53.87046 53.87046 51.51672
[15] 58.90486 67.16431 77.14819 82.97153 72.68200 66.47610 83.38311
[22] 87.98205 84.85845 100.53096 111.58179 132.22227 136.96744 139.80524
[29] 141.37167 141.37167 201.36365